A farewell to Bonferroni : the problems of low statistical power and publication bias

نویسنده

  • Shinichi Nakagawa
چکیده

Recently, Jennions and Møller (2003) carried out a metaanalysis on statistical power in the field of behavioral ecology and animal behavior, reviewing 10 leading journals including Behavioral Ecology. Their results showed dismayingly low average statistical power (note that a meta-analytic review of statistical power is different from post hoc power analysis as criticized in Hoenig and Heisey, 2001). The statistical power of a null hypothesis (Ho) significance test is the probability that the test will reject Ho when a research hypothesis (Ha) is true. Knowledge of effect size is particularly important for statistical power analysis (for statistical power analysis, see Cohen, 1988; Nakagawa and Foster, in press). There are many kinds of effect size measures available (e.g., Pearson’s r, Cohen’s d, Hedges’s g), but most of these fall into one of two major types, namely the r family and the d family (Rosenthal, 1994). The r family shows the strength of relationship between two variables while the d family shows the size of difference between two variables. As a benchmark for research planning and evaluation, Cohen (1988) proposed ‘conventional’ values for small, medium, and large effects: r 1⁄4.10, .30, and .50 and d 1⁄4.20, .50, and .80, respectively (in the way that p values of .05, .01, and .001 are conventional points, although these conventional values of effect size have been criticized; e.g., Rosenthal et al., 2000). The meta-analysis on statistical power by Jennions and Møller (2003) revealed that, in the field of behavioral ecology and animal behavior, statistical power of less than 20% to detect a small effect and power of less than 50% to detect a medium effect existed. This means, for example, that the average behavioral scientist performing a statistical test has a greater probability of making a Type II error (or b) (i.e., not rejecting Ho when Ho is false; note that statistical power is equals to 1 2 b) than if they had flipped a coin, when an experiment effect is of medium size (i.e., r 1⁄4 .30, d 1⁄4 .50). Here, I highlight and discuss an implication of this low statistical power on one of the most widely used statistical procedures, Bonferroni correction (Cabin and Mitchell, 2000). Bonferroni corrections are employed to reduce Type I errors (i.e., rejecting Ho when Ho is true) when multiple tests or comparisons are conducted. Two kinds of Bonferroni procedures are commonly used. One is the standard Bonferroni procedure, where a modified significant criterion (a/k where k is the number of statistical tests conducted on given data) is used. The other is the sequential Bonferroni procedure, which was introduced by Holm (1979) and popularized in the field of ecology and evolution by Rice (1989) (see these papers for the procedure). For example, in a recent volume of Behavioral Ecology (vol. 13, 2002), nearly one-fifth of papers (23 out of 117) included Bonferroni corrections. Twelve articles employed the standard procedure while 11 articles employed the sequential procedure (10 citing Rice, 1989, and one citing Holm, 1979). A serious problem associated with the standard Bonferroni procedure is a substantial reduction in the statistical power of rejecting an incorrect Ho in each test (e.g., Holm, 1979; Perneger, 1998; Rice, 1989). The sequential Bonferroni procedure also incurs reduction in power, but to a lesser extent (which is the reason that the sequential procedure is used in preference by some researchers; Moran, 2003). Thus, both procedures exacerbate the existing problem of low power, identified by Jennions and Møller (2003). For example, suppose an experiment where both an experimental group and a control group consist of 30 subjects. After an experimental period, we measure five different variables and conduct a series of t tests on each variable. Even prior to applying Bonferroni corrections, the statistical power of each test to detect a medium effect is 61% (a 1⁄4 .05), which is less than a recommended acceptable 80% level (Cohen, 1988). In the field of behavioral ecology and animal behavior, it is usually difficult to use large sample sizes (in many cases, n , 30) because of practical and ethical reasons (see Still, 1992). When standard Bonferroni corrections are applied, the statistical power of each t test drops to as low as 33% (to detect a medium effect at a/5 1⁄4 .01). Although sequential Bonferroni corrections do not reduce the power of the tests to the same extent, on average (33–61% per t test), the probability of making a Type II error for some of the tests (b 1⁄4 1 2 power, so 39–66%) remains unacceptably high. Furthermore, statistical power would be even lower if we measured more than five variables or if we were interested in detecting a small effect. Bonferroni procedures appear to raise another set of problems. There is no formal consensus for when Bonferroni procedures should be used, even among statisticians (Perneger, 1998). It seems, in some cases, that Bonferroni corrections are applied only when their results remain significant. Some researchers may think that their results are ‘more significant’ if the results pass the rigor of Bonferroni corrections, although this is logically incorrect (Cohen, 1990, 1994; Yoccoz, 1991). Many researchers are already reluctant to report nonsignificant results ( Jennions and Møller, 2002a,b). The wide use of Bonferroni procedures may be aggravating the tendency of researchers not to present nonsignificant results, because presentation of more tests with nonsignificant results may make previously ‘significant’ results ‘nonsignificant’ under Bonferroni procedures. The more detailed research (i.e., research measuring more variables) researchers do, the less probability they have of finding significant results. Moran (2003) recently named this paradox as a hyper-Red Queen phenomenon (see the paper for more discussion on problems with the sequential method). Imagine that we conduct a study where we measure as many relevant variables as possible, 10 variables, for example. We find only two variables statistically significant. Then, what should we do? We could decide to write a paper highlighting these two variables (and not reporting the other eight at all) as if we had hypotheses about the two significant variables in the first place. Subsequently, our paper would be published. Alternatively, we could write a paper including all 10 variables. When the paper is reviewed, referees might tell us that there were no significant results if we had ‘appropriately’ employed Bonferroni corrections, so that our study would not be advisable for publication. However, the latter paper is Behavioral Ecology Vol. 15 No. 6: 1044–1045 doi:10.1093/beheco/arh107 Advance Access publication on June 30, 2004

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

کاربرد روش‌های شناسایی تورش انتشار برای فراتحلیل در ارزیابی تاثیر داروی آلبندازول در درمان مبتلایان به آسکاریس و تریکوسفال

 Background : Meta analysis is a statistical method to combine the findings of a set of large number of published individual studies and re-analyse them. The use of meta-analysis methods in medical research has been increased, noticeably, in resent years. However, one of the major shortcomings in such analysis is that the researcher, could not access all conducted studies in the area of concern...

متن کامل

Noise Equivalent Power Optimization of Graphene- Superconductor Optical Sensors in the Current Bias Mode

In this paper, the noise equivalent power (NEP) of an optical sensor based ongraphene-superconductor junctions in the constant current mode of operation has beencalculated. Furthermore, the necessary investigations to optimize the device noise withrespect to various parameters such as the operating temperature, magnetic field, deviceresistance, voltage and current bias have been presented. By s...

متن کامل

Low birth weight in Iran: Implications from a systematic review of the literature and meta-analysis in the period 1999-2017

Background: Low birth weight (LBW), a crucial determinant of neonatal complications, represents a major public health concern worldwide. Epidemiological research is of crucial importance for designing and implementing ad hoc interventions for this issue, helping and guiding decision- and policy-makers in each country to prevent the increased prevalence of LBW in infants through estimating the p...

متن کامل

Bias and Unbias Comparison between Economic Statistical Design T2-VSI and T2-FRS Control Charts

The Hotelling's  control chart, is the most widely used multivariate procedure for monitoring  two or more related quality characteristics, but it’s power lacks the desired performance in detecting small to moderate shifts. Recently, the variable sampling intervals (VSI) control scheme in which the length of successive sampling intervals is determined upon the preceding values has been proved t...

متن کامل

The Effect of CLIL on Language Skills and Components: A Meta-Analysis

Content and Language Integrated Learning (CLIL) has recently been the focus of numerous studies in language education since it aims to overcome the pitfalls of form-focused and meaning-focused instruction by systematically integrating content and language. This meta-analysis aims to synthesize the findings of 22 primary studies that tested the effect of CLIL on language skills and components. G...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004